{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploratory Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import torch\n",
    "from torch.utils.data import DataLoader, Dataset\n",
    "from torchvision.utils import make_grid\n",
    "from torchvision import transforms\n",
    "from collections import defaultdict\n",
    "from torchvision.datasets.folder import pil_loader\n",
    "import random\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import pylab\n",
    "from skimage import io, transform\n",
    "\n",
    "pd.set_option('max_colwidth', 800)\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have seven categories of musculoskeletal radiographs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_csv('../MURA-v1.0/train.csv', names=['Path', 'Label'])\n",
    "valid_df = pd.read_csv('../MURA-v1.0/valid.csv', names=['Path', 'Label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's checkout the shapes of dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.shape, valid_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 37111 radiographs for training and 3225 radiographs for validation set, let's peak into the dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_df.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we have radiograph paths and their correspoinding labels, each radiographs has a label of 0 (normal) or 1 (abnormal)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "According to paper: <br>\n",
    "1.\n",
    "\n",
    "    The MURA abnormality detection task is a binary classification task, where the input is an upper \n",
    "    exremity radiograph study — with each study containing one or more views (images) — and the \n",
    "    expected output is a binary label y ∈ {0, 1} indicating whether the \"study\" is normal or abnormal, \n",
    "    respectively.\n",
    "2.\n",
    "\n",
    "    The model takes as input one or more views for a study of an upper extremity. On each view, our 169-\n",
    "    layer convolutional neural network predicts the probability of abnormality. We compute the overall \n",
    "    probability of abnormality for the study by taking the arithmetic mean of the abnormality \n",
    "    probabilities output by the network for each image. The model makes the binary prediction of \n",
    "    abnormal if the probability of abnormality for the study is greater than 0.5."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we have make predictions on study level, taking into account the predictions of all the views (images) of the study. This can be done by taking arithmetic mean of all the views (images) under a particular study."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head(30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Analyzing this dataframe, we can see that images are annotated based on whether their corresponding study is positive (normal, 0) or negative (abnormal, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Plot some random radiographs from training and validation set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_mat = train_df.as_matrix()\n",
    "valid_mat = valid_df.as_matrix()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ix = np.random.randint(0, len(train_mat)) # randomly select a index\n",
    "img_path = train_mat[ix][0]\n",
    "plt.imshow(io.imread(img_path), cmap='binary')\n",
    "cat = img_path.split('/')[2] # get the radiograph category\n",
    "plt.title('Category: %s & Lable: %d ' %(cat, train_mat[ix][1]))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ix = np.random.randint(0, len(valid_mat))\n",
    "img_path = valid_mat[ix][0]\n",
    "plt.imshow(io.imread(img_path), cmap='binary')\n",
    "cat = img_path.split('/')[2]\n",
    "plt.title('Category: %s & Lable: %d ' %(cat, valid_mat[ix][1]))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can be seen that images vary in resolution and dimension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# look at the pixel values\n",
    "io.imread(img_path)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!ls ../MURA-v1.0/train/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!ls ../MURA-v1.0/train/XR_ELBOW/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, train dataset has seven study types, each study type has studies on patients stored in folders named like patient001, patient002 etc.."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Patient count per study type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's count number of patients in each study type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_cat= ['train', 'valid']\n",
    "study_types = list(os.walk('../MURA-v1.0/train/'))[0][1] # study types, same for train and valid sets\n",
    "patients_count = {}  # to store all patients count for each study type, for train and valid sets\n",
    "for phase in data_cat:\n",
    "    patients_count[phase] = {}\n",
    "    for study_type in study_types:\n",
    "        patients = list(os.walk('../MURA-v1.0/%s/%s' %(phase, study_type)))[0][1] # patient folder names\n",
    "        patients_count[phase][study_type] = len(patients)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(study_types)\n",
    "print()\n",
    "print(patients_count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the patient counts per study type \n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 5))\n",
    "for i, phase in enumerate(data_cat):\n",
    "    counts = patients_count[phase].values()\n",
    "    m = max(counts)\n",
    "    for i, v in enumerate(counts):\n",
    "        if v==m: ax.text(i-0.1, v+3, str(v))\n",
    "        else: ax.text(i-0.1, v + 20, str(v))\n",
    "    x_pos = np.arange(len(study_types))\n",
    "    plt.bar(x_pos, counts, alpha=0.5)\n",
    "    plt.xticks(x_pos, study_types)\n",
    "\n",
    "plt.xlabel('Study types')\n",
    "plt.ylabel('Number of patients')\n",
    "plt.legend(['train', 'valid'])\n",
    "plt.show()\n",
    "fig.savefig('images/pcpst.jpg', bbox_inches='tight', pad_inches=0) # name=patient count per study type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "XR_FINGER has got the most number of patients (1867 in train set, 166 in valid set) followed by XR_WRIST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Study count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Patients might have multiple studies for a given study type, like a patient may have two studies for wrist, independent of each other. <br> Let's have a look at such cases, **NOTE** here study count = number of patients which have same number of studies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's find out number of studies per study_type\n",
    "study_count = {} # to store study counts for each study type \n",
    "for study_type in study_types:\n",
    "    BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n",
    "    study_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n",
    "    patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n",
    "    for patient in patients:\n",
    "        studies = os.listdir(BASE_DIR+patient)\n",
    "        study_count[study_type][len(studies)] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "study_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "XR_WRIST has 3111 patients who have only single study, similarly, 158 patients have 2 studies, 12 patients have 3 studies and 4 patients have 4 studies. <br> let's plot this data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the study count vs number of patients per study type data \n",
    "fig = plt.figure(figsize=(8, 25))\n",
    "for i, study_type in enumerate(study_count):\n",
    "    ax = fig.add_subplot(7, 1, i+1)\n",
    "    study = study_count[study_type]\n",
    "    # text in the plot\n",
    "    m = max(study.values())\n",
    "    for i, v in enumerate(study.values()):\n",
    "        if v==m: ax.text(i-0.1, v - 200, str(v))\n",
    "        else: ax.text(i-0.1, v + 200, str(v))\n",
    "    ax.text(i, m - 100, study_type, color='green')\n",
    "    # plot the bar chart\n",
    "    x_pos = np.arange(len(study))\n",
    "    plt.bar(x_pos, study.values(), align='center', alpha=0.5)\n",
    "    plt.xticks(x_pos,  study.keys())\n",
    "    plt.xlabel('Study count')\n",
    "    plt.ylabel('Number of patients')\n",
    "plt.show()\n",
    "fig.savefig('images/pcpsc.jpg', bbox_inches='tight', pad_inches=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Number of views per study"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can be seen that each study may have more that one view (radiograph image), let' have a look"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# let's find out number of studies per study_type\n",
    "view_count = {} # to store study counts for each study type, study count = number of patients which have similar number of studies \n",
    "for study_type in study_types:\n",
    "    BASE_DIR = '../MURA-v1.0/train/%s/' % study_type\n",
    "    view_count[study_type] = defaultdict(lambda:0) # to store study count for current study_type, initialized to 0 by default\n",
    "    patients = list(os.walk(BASE_DIR))[0][1] # patient folder names\n",
    "    for patient in patients:\n",
    "        studies = os.listdir(BASE_DIR + patient)\n",
    "        for study in studies:\n",
    "            views = os.listdir(BASE_DIR + patient + '/' + study)\n",
    "            view_count[study_type][len(views)] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "view_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`XR_SHOULDER` has as many as 13 views in some studies, `XR_HAND` has 5 at max, this poses a challenging task to predict on a study taking into account all the views of that study while keeping the batch size of 8 (as mentioned in MURA paper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the view count vs number of studies per study type data \n",
    "fig = plt.figure(figsize=(10, 30))\n",
    "for i, view_type in enumerate(view_count):\n",
    "    ax = fig.add_subplot(7, 1, i+1)\n",
    "    view = view_count[view_type]\n",
    "    # text in the plot\n",
    "    m = max(view.values())\n",
    "    for i, v in enumerate(view.values()):\n",
    "        if v==m: ax.text(i-0.1, v - 200, str(v))\n",
    "        else: ax.text(i-0.1, v + 80, str(v))\n",
    "    ax.text(i - 0.5, m - 80, view_type, color='green')\n",
    "    # plot the bar chart\n",
    "    x_pos = np.arange(len(view))\n",
    "    plt.bar(x_pos, view.values(), align='center', alpha=0.5)\n",
    "    plt.xticks(x_pos,  view.keys())\n",
    "    plt.xlabel('Number of views')\n",
    "    plt.ylabel('Number of studies')\n",
    "plt.show()\n",
    "fig.savefig('images/nsvc.jpg', bbox_inches='tight', pad_inches=0) # name=number of studies view count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most of the studies contain 2, 3 or 4 views"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
